Crate regex_test

source ·
Expand description

A crate for defining tests in a TOML format and applying them to regex engine implementations.

Generally speaking, if you aren’t writing your own regex engine and looking to test it, then this crate is probably not for you. Moreover, this crate does not come with any actual tests. It merely defines the test format and provides some convenient routines for executing tests within the context of Rust unit tests.

Format

The entire test corpus is derived from zero or more TOML files. Each TOML file contains zero or more tests, where each test is defined as a table via [[test]].

Each test has the following fields:

  • name - A name for the test. It must be unique within its file. A test’s RegexTest::full_name is derived either via {group_name}/{name} or {group_name}/{name}/{additional_name}, with the latter being used only when TestRunner::expand is used. The group_name is derived from the file stem (the file name without the `.toml suffix).
  • regex - The regex to test. This is either a string or a (possibly empty) list of regex patterns. When using a list, the underlying regex engine is expected to support multiple patterns where each are identified starting from 0 and incrementing by 1 for each subsequent pattern.
  • haystack - The text to search.
  • bounds - An optional field whose value is a table with start and end fields, whose values must be valid for the given haystack. When set, the search will only execute within these bounds. When absent, the bounds correspond to start = 0 and end = haystack.len().
  • matches - Zero or more match values. Each match value can be in one of four formats:
    • A simple span, i.e., [5, 12], corresponding to the start and end of the match, in byte offsets. The start is inclusive and the end is exclusive. The pattern ID for the match is assumed to be 0.
    • A table corresponding to the matching pattern ID and the span of the match. For example, { id = 5, span = [20, 21] }.
    • A list of capture group spans, with the first corresponding to the overall match and the pattern ID assumed to be 0. For example, [[5, 10], [6, 8], [], [9, 10]], where [] corresponds to a group present in the regex but one that did not participate in a match.
    • A table corresponding to the matching pattern ID and a list of spans corresponding to the capture groups. For example, { id = 5, spans = [[5, 10], [6, 8], [], [9, 10]] }. This is the most general, but also most verbose, syntax.
  • match-limit - An optional field that specifies a limit on the number of matches. When absent, no limit is enforced and all matches should be reported by the regex engine. This can be useful, for example, when one only cares about the first match.
  • compiles - An optional field indicating whether the regex is expected to compile. It defaults to true when absent. When true, if the regex does not compile, then the test fails. Conversely, when false, if the regex does compile, then the test fails.
  • anchored - Whether to execute an anchored search or not. Note that this is not the same as adding a ^ to the beginning of your regex pattern. ^ always requires the regex to match at position 0, but an anchored search simply requires that the regex match at the starting position of the search. (The starting position of the search can be configured via the optional bounds field.)
  • case-insensitive - Whether to match the regex case insensitively. This is disabled by default. There is no real difference between using this field and adding a (?i) to the beginning of your regex. (Some regex engines may not support (?i).)
  • unescape - When enabled, the haystack is unescaped. Sequences like \x00 are turned into their corresponding byte values. This permits one to write haystacks that contain invalid UTF-8 without embedding actual invalid UTF-8 into a TOML file (which is not allowed). There is generally no other reason to enable unescape.
  • unicode - When enabled, the regex pattern should be compiled with its corresponding Unicode mode enabled. For example, [^a] matches any UTF-8 encoding of any codepoint other than a. Case insensitivty should be Unicode aware. Unicode classes like \pL are available. The Perl classes \w, \s and \d should be Unicode aware. And so on. This is an optional field and is enabled by default.
  • utf8 - When this is enabled, all regex match substrings should be entirely valid UTF-8. While parts of the haystack the regex searches through may not be valid UTF-8, only the portions that are valid UTF-8 may be reported in match spans. Importantly, this includes zero-width matches. Zero-width matches must never split the UTF-8 encoding of a single codepoint when this is enabled. This is an optional field and is enabled by default.
  • line-terminator - This sets the line terminator used by the multi-line assertions (?m:^) and (?m:$). It defaults to \n. It must be exactly one byte. This field is automatically unescaped in order to permit a non-ASCII byte.
  • match-kind - May be one of all, leftmost-first or leftmost-longest. See MatchKind for more details. This is an optional field and defaults to leftmost-first.
  • search-kind - May be one of earliest, leftmost or overlapping. See SearchKind for more details. This is an optional field and defaults to leftmost.

Re-exports

Structs

  • Captures represents a single group of captured matches from a regex search.
  • The result of compiling a regex.
  • A single match, consisting of the pattern that matched and its span.
  • A regex test describes the inputs and expected outputs of a regex match.
  • A collection of regex tests.
  • An iterator over regex tests.
  • A span of contiguous bytes, from start to end, represented via byte offsets.
  • The result of executing a regex search.
  • A runner for executing regex tests.

Enums

  • The match semantics to use for a search.
  • Represents the type of search to perform.